This study is part of the MACRO project
Figure 0.1: MACRO project logo
Information on the spatial distribution of hydrogeochemical parameters is crucial for decision making. Machine learning based methods for the mapping of hydrogeochemical parameter concentrations have been already studied for many years to evolve from deterministic and geostatistical interpolation methods. However, the reflection of all relevant processes that the target variable depends on is often difficult to achieve, because of the mostly insufficient determination and/or availability of features. This is especially true if you limit yourself to freely accessible data.
In this study, we apply an extreme gradient boosting learner (XGB) to map major ion concentrations across Germany. The training data consist of water samples from approximately 50K observation wells across Germany and a wide range of environmental data as predictors. The water samples were collected between the 1950s and 2005 at anthropogenically undisturbed locations.
The environmental data includes hydrogeological units and parameters, soil type, lithology, digital elevation model (DEM) and DEM derived parameters etc. The values of these features at the respective water sample location were extracted on the basis of a polygon, approximately representing the area that has an impact on the target variable (ion concentration). For a comparison, different polygon shapes are used.
The model was set up as chained multioutput regression, meaning that the prediction of the previous model in a linear sequence of single-output models is used as input for the subsequent model.
The results are planned to serve for a comparison with state-of-the-art deep learning architectures.
All data processing was done in the R programming language using multiple packages (see chapter References)
The training data, that is used in this study, is based on a dataset containing approximately 53000 measurements of hydrogeochemical parameters from groundwater samples predominantely taken during the second half of the 20th century until 2010-01-01. The number of samples used for training the models is reduced due to the following steps:
Intersection with Study Area
The sample locations were intersected with the administrative border of Germany
Filter by Sample Date
Only samples between 100 and 2010-01-01 were kept in the dataset. The distribution of the sample date after applying the previous processing steps is shown in figure 2.1.
Figure 2.1: Distribution of sample date
Filter by Sample Depth
Only samples up to a depth of 100 were kept in the dataset. The distribution of the sample depth after applying the previous processing steps is shown in figure 2.2.
Figure 2.2: Distribution of sample depth
Aggregation of multiple measurements per sample site
Some of the sample sites have multiple measurements over time. These values were aggregated by calculating the mean. The distribution of the number of measurements per sample site after applying the previous processing steps is shown in figure 2.3.
Figure 2.3: Distribution of multiple measurements per sample site
From all measured parameters, only the major anions and cations with the most samples were chosen as target variables to be modeled. Across all these parameters and after all preprocessing steps, 34536 samples and 12 columns, 1 for the station ID (station_id), 1 for the sample depth (sample_depth) and 3.4535^{4} for each target variable remain for the model training.
The locations of the sample sites used for modeling are shown in 2.4 as the number of sample sites per hexagon. The spatial distribution of sampling locations is unbalanced with regions that have few locations and regions with a high density. The latter are mainly concentrated around larger cities in Germany such as Berlin, Hamburg, Frankfurt. The eastern and northern areas of Germany also generally have more sampling sites compared to southern or central Germany.
Figure 2.4: Sample site locations
The sample depth which was calculated as mean depth of the screen top and bottom if it was provided. Subsequently, the sample depth was used as feature (predictor variable) during training.
The first three rows of the data set containing the target variables after all preprocessing steps is shown in table 2.1 as an example.
| station_id | ca_mg_l | cl_mg_l | fe_mg_l | hco3_mg_l | k_mg_l | mg_mg_l | mn_mg_l | na_mg_l | no3_mg_l | so4_mg_l |
|---|---|---|---|---|---|---|---|---|---|---|
| 110_1015 | 44.7 | 19.5 | 17.2 | 164.7 | 1.7 | 6.5 | 0.75 | 9.4 | 1.0 | 29.5 |
| 110_1016 | 82.9 | 46.0 | 15.0 | 207.4 | 6.5 | 13.9 | 2.80 | 25.0 | 0.1 | 126.0 |
| 110_1017 | 90.0 | 35.0 | 10.8 | 323.3 | 4.0 | 9.5 | 0.65 | 23.0 | 0.1 | 33.0 |
Figure 2.5 gives an overview on the occurrence of missing values in that data set. The occurrence of missing values varies between the different target variables which leads to different sample sizes when modeling each target separately
Figure 2.5: Missing values across the target variables
More details on the column statistics are shown in the following summary
| Name | Piped data |
| Number of rows | 34536 |
| Number of columns | 10 |
| _______________________ | |
| Column type frequency: | |
| numeric | 10 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| ca_mg_l | 4303 | 0.88 | 88.39 | 103.78 | 0 | 40.50 | 74.20 | 112.00 | 3670.0 | ▇▁▁▁▁ |
| cl_mg_l | 3025 | 0.91 | 250.48 | 4212.52 | 0 | 13.00 | 25.00 | 45.87 | 177000.0 | ▇▁▁▁▁ |
| fe_mg_l | 13758 | 0.60 | 3.18 | 12.42 | 0 | 0.10 | 0.86 | 2.80 | 1170.0 | ▇▁▁▁▁ |
| hco3_mg_l | 4578 | 0.87 | 220.75 | 175.41 | 0 | 99.00 | 207.40 | 318.10 | 7755.7 | ▇▁▁▁▁ |
| k_mg_l | 9258 | 0.73 | 5.62 | 31.51 | 0 | 1.10 | 2.00 | 4.00 | 1440.0 | ▇▁▁▁▁ |
| mg_mg_l | 4264 | 0.88 | 17.28 | 43.32 | 0 | 5.00 | 10.10 | 19.30 | 1801.3 | ▇▁▁▁▁ |
| mn_mg_l | 9379 | 0.73 | 0.29 | 1.30 | 0 | 0.01 | 0.11 | 0.29 | 156.0 | ▇▁▁▁▁ |
| na_mg_l | 10451 | 0.70 | 180.93 | 2978.77 | 0 | 7.00 | 12.20 | 24.00 | 116000.0 | ▇▁▁▁▁ |
| no3_mg_l | 9548 | 0.72 | 14.47 | 26.88 | 0 | 0.10 | 3.00 | 18.00 | 708.0 | ▇▁▁▁▁ |
| so4_mg_l | 4276 | 0.88 | 89.93 | 212.27 | 0 | 18.60 | 43.00 | 90.50 | 6880.0 | ▇▁▁▁▁ |
The distribution of target values as violin chart is shown in figure 2.6.
Figure 2.6: Distribution of the target variable values
In addition to this dataset, geophysical attributes were extracted from other spatial data sources (see the following list) and used as features:
The features were extracted for a 1km buffer as approximated groundwater contributing area for every sample location respectively (Knoll, Breuer, and Bach 2019) (see figure 2.7). For categorical data, the proportion of each class in the buffer was calculated. As an advantage, this leads to an encoding as numerical feature. On the other hand, many sparse features are created for rare classes. For numerical data, the mean was calculated for this buffer.
Figure 2.7: Example of feature extraction based on circular buffer around sample sites (red) (e.g. the land use and land cover data as shown here)
The previously described method of extracting the features results in 165 features. A summary of the statistics of the features is provided in tab. 2.3.
| Name | Piped data |
| Number of rows | 34536 |
| Number of columns | 165 |
| _______________________ | |
| Column type frequency: | |
| numeric | 165 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| sampledepth_sampledepth | 0 | 1 | 41.71 | 36.13 | 0.00 | 17.00 | 42.00 | 50.00 | 644.00 | ▇▁▁▁▁ |
| lulc_agriculturalareas | 0 | 1 | 0.53 | 0.33 | 0.00 | 0.24 | 0.56 | 0.83 | 1.00 | ▆▅▅▆▇ |
| lulc_forestandseminaturalareas | 0 | 1 | 0.30 | 0.32 | 0.00 | 0.00 | 0.19 | 0.52 | 1.00 | ▇▂▂▂▂ |
| lulc_artificialsurfaces | 0 | 1 | 0.15 | 0.24 | 0.00 | 0.00 | 0.05 | 0.20 | 1.00 | ▇▁▁▁▁ |
| lulc_waterbodies | 0 | 1 | 0.02 | 0.06 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| lulc_wetlands | 0 | 1 | 0.00 | 0.02 | 0.00 | 0.00 | 0.00 | 0.00 | 0.84 | ▇▁▁▁▁ |
| gwrecharge_gwrecharge | 0 | 1 | 121.53 | 71.54 | 0.00 | 73.44 | 109.89 | 158.57 | 879.34 | ▇▂▁▁▁ |
| seepage_seepage | 2 | 1 | 253.65 | 187.94 | -99.89 | 131.54 | 223.25 | 335.48 | 2559.33 | ▇▁▁▁▁ |
| temperature_temperature | 0 | 1 | 84.56 | 8.72 | 8.82 | 80.65 | 85.08 | 89.54 | 107.44 | ▁▁▁▇▆ |
| precipitation_precipitation | 0 | 1 | 719.63 | 191.05 | 453.25 | 574.28 | 691.90 | 800.75 | 2646.69 | ▇▁▁▁▁ |
| hydrounits_14 | 0 | 1 | 0.21 | 0.40 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▂ |
| hydrounits_13 | 0 | 1 | 0.12 | 0.31 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| hydrounits_17 | 0 | 1 | 0.03 | 0.17 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| hydrounits_15 | 0 | 1 | 0.14 | 0.34 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| hydrounits_63 | 0 | 1 | 0.02 | 0.13 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| hydrounits_31 | 0 | 1 | 0.05 | 0.21 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| hydrounits_62 | 0 | 1 | 0.02 | 0.14 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| hydrounits_101 | 0 | 1 | 0.02 | 0.13 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| hydrounits_41 | 0 | 1 | 0.04 | 0.20 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| hydrounits_64 | 0 | 1 | 0.01 | 0.11 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| hydrounits_65 | 0 | 1 | 0.00 | 0.04 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| hydrounits_0 | 0 | 1 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.43 | ▇▁▁▁▁ |
| hydrounits_66 | 0 | 1 | 0.01 | 0.10 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| hydrounits_95 | 0 | 1 | 0.02 | 0.14 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| hydrounits_71 | 0 | 1 | 0.01 | 0.10 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| hydrounits_92 | 0 | 1 | 0.04 | 0.20 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| hydrounits_96 | 0 | 1 | 0.01 | 0.08 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| hydrounits_32 | 0 | 1 | 0.01 | 0.11 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| hydrounits_52 | 0 | 1 | 0.01 | 0.11 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| hydrounits_12 | 0 | 1 | 0.01 | 0.10 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| hydrounits_11 | 0 | 1 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.06 | ▇▁▁▁▁ |
| hydrounits_81 | 0 | 1 | 0.05 | 0.22 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| hydrounits_33 | 0 | 1 | 0.01 | 0.11 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| hydrounits_54 | 0 | 1 | 0.02 | 0.13 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| hydrounits_51 | 0 | 1 | 0.02 | 0.13 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| hydrounits_16 | 0 | 1 | 0.03 | 0.18 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| hydrounits_83 | 0 | 1 | 0.00 | 0.03 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| hydrounits_22 | 0 | 1 | 0.02 | 0.13 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| hydrounits_21 | 0 | 1 | 0.02 | 0.13 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| hydrounits_53 | 0 | 1 | 0.00 | 0.04 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| hydrounits_23 | 0 | 1 | 0.01 | 0.11 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| hydrounits_61 | 0 | 1 | 0.01 | 0.08 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| hydrounits_82 | 0 | 1 | 0.00 | 0.04 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| hydrounits_94 | 0 | 1 | 0.00 | 0.05 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| hydrounits_91 | 0 | 1 | 0.01 | 0.11 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| hydrounits_93 | 0 | 1 | 0.01 | 0.08 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| hydrounits_97 | 0 | 1 | 0.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| geology_114 | 0 | 1 | 0.37 | 0.45 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▅ |
| geology_111 | 0 | 1 | 0.14 | 0.31 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| geology_113 | 0 | 1 | 0.16 | 0.35 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| geology_115 | 0 | 1 | 0.01 | 0.08 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| geology_223 | 0 | 1 | 0.00 | 0.06 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| geology_231 | 0 | 1 | 0.02 | 0.15 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| geology_232 | 0 | 1 | 0.02 | 0.11 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| geology_233 | 0 | 1 | 0.04 | 0.18 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| geology_600 | 0 | 1 | 0.02 | 0.14 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| geology_120 | 0 | 1 | 0.02 | 0.14 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| geology_221 | 0 | 1 | 0.01 | 0.10 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| geology_130 | 0 | 1 | 0.01 | 0.07 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| geology_330 | 0 | 1 | 0.00 | 0.03 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| geology_230 | 0 | 1 | 0.00 | 0.04 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| geology_400 | 0 | 1 | 0.03 | 0.17 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| geology_222 | 0 | 1 | 0.00 | 0.06 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| geology_500 | 0 | 1 | 0.01 | 0.10 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| geology_510 | 0 | 1 | 0.01 | 0.08 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| geology_312 | 0 | 1 | 0.01 | 0.11 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| geology_320 | 0 | 1 | 0.01 | 0.08 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| geology_220 | 0 | 1 | 0.00 | 0.03 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| geology_112 | 0 | 1 | 0.01 | 0.09 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| geology_211 | 0 | 1 | 0.02 | 0.13 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| geology_210 | 0 | 1 | 0.00 | 0.05 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| geology_350 | 0 | 1 | 0.01 | 0.10 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| geology_360 | 0 | 1 | 0.01 | 0.08 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| geology_888 | 0 | 1 | 0.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0.00 | 0.83 | ▇▁▁▁▁ |
| geology_333 | 0 | 1 | 0.02 | 0.14 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| geology_332 | 0 | 1 | 0.01 | 0.09 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| geology_331 | 0 | 1 | 0.00 | 0.05 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| geology_300 | 0 | 1 | 0.00 | 0.04 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| geology_311 | 0 | 1 | 0.00 | 0.04 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| geology_212 | 0 | 1 | 0.00 | 0.05 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| geology_340 | 0 | 1 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.50 | ▇▁▁▁▁ |
| soilunits_19 | 0 | 1 | 0.14 | 0.34 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| soilunits_12 | 0 | 1 | 0.03 | 0.15 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| soilunits_28 | 0 | 1 | 0.11 | 0.31 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| soilunits_17 | 0 | 1 | 0.11 | 0.30 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| soilunits_33 | 0 | 1 | 0.04 | 0.19 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| soilunits_31 | 0 | 1 | 0.08 | 0.26 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| soilunits_8 | 0 | 1 | 0.06 | 0.23 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| soilunits_36 | 0 | 1 | 0.04 | 0.18 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| soilunits_27 | 0 | 1 | 0.01 | 0.08 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| soilunits_41 | 0 | 1 | 0.02 | 0.14 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| soilunits_11 | 0 | 1 | 0.03 | 0.16 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| soilunits_40 | 0 | 1 | 0.05 | 0.20 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| soilunits_63 | 0 | 1 | 0.01 | 0.12 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| soilunits_55 | 0 | 1 | 0.04 | 0.20 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| soilunits_44 | 0 | 1 | 0.01 | 0.09 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| soilunits_21 | 0 | 1 | 0.01 | 0.09 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| soilunits_49 | 0 | 1 | 0.01 | 0.12 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| soilunits_61 | 0 | 1 | 0.03 | 0.17 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| soilunits_30 | 0 | 1 | 0.01 | 0.08 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| soilunits_54 | 0 | 1 | 0.04 | 0.20 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| soilunits_15 | 0 | 1 | 0.00 | 0.03 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| soilunits_52 | 0 | 1 | 0.00 | 0.06 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| soilunits_72 | 0 | 1 | 0.00 | 0.02 | 0.00 | 0.00 | 0.00 | 0.00 | 0.96 | ▇▁▁▁▁ |
| soilunits_69 | 0 | 1 | 0.00 | 0.07 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| soilunits_18 | 0 | 1 | 0.02 | 0.14 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| soilunits_59 | 0 | 1 | 0.03 | 0.18 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| soilunits_68 | 0 | 1 | 0.00 | 0.02 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| soilunits_4 | 0 | 1 | 0.01 | 0.09 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| soilunits_2 | 0 | 1 | 0.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| soilunits_5 | 0 | 1 | 0.00 | 0.04 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| soilunits_60 | 0 | 1 | 0.01 | 0.10 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| soilunits_66 | 0 | 1 | 0.02 | 0.14 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| soilunits_51 | 0 | 1 | 0.00 | 0.06 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| soilunits_34 | 0 | 1 | 0.00 | 0.03 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| soilunits_46 | 0 | 1 | 0.00 | 0.05 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| soilunits_42 | 0 | 1 | 0.01 | 0.07 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| soilunits_22 | 0 | 1 | 0.01 | 0.11 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| hydrogeologygc_s | 0 | 1 | 0.78 | 0.38 | 0.00 | 0.73 | 1.00 | 1.00 | 1.00 | ▂▁▁▁▇ |
| hydrogeologygc_Gew | 0 | 1 | 0.00 | 0.04 | 0.00 | 0.00 | 0.00 | 0.00 | 0.82 | ▇▁▁▁▁ |
| hydrogeologygc_m | 0 | 1 | 0.13 | 0.30 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| hydrogeologygc_a | 0 | 1 | 0.00 | 0.03 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| hydrogeologygc_so | 0 | 1 | 0.01 | 0.07 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| hydrogeologygc_kA | 0 | 1 | 0.00 | 0.02 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| hydrogeologygc_k | 0 | 1 | 0.06 | 0.21 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| hydrogeologygc_g | 0 | 1 | 0.01 | 0.09 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| hydrogeologygc_h | 0 | 1 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.14 | ▇▁▁▁▁ |
| hydrogeologygc_gh | 0 | 1 | 0.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0.00 | 0.90 | ▇▁▁▁▁ |
| hydrogeologykf_3 | 0 | 1 | 0.41 | 0.44 | 0.00 | 0.00 | 0.15 | 0.99 | 1.00 | ▇▁▁▁▅ |
| hydrogeologykf_9 | 0 | 1 | 0.22 | 0.36 | 0.00 | 0.00 | 0.00 | 0.32 | 1.00 | ▇▁▁▁▂ |
| hydrogeologykf_10 | 0 | 1 | 0.09 | 0.26 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| hydrogeologykf_4 | 0 | 1 | 0.06 | 0.18 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| hydrogeologykf_6 | 0 | 1 | 0.03 | 0.15 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| hydrogeologykf_99 | 0 | 1 | 0.00 | 0.04 | 0.00 | 0.00 | 0.00 | 0.00 | 0.82 | ▇▁▁▁▁ |
| hydrogeologykf_11 | 0 | 1 | 0.02 | 0.10 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| hydrogeologykf_12 | 0 | 1 | 0.05 | 0.18 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| hydrogeologykf_0 | 0 | 1 | 0.00 | 0.02 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| hydrogeologykf_5 | 0 | 1 | 0.06 | 0.19 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| hydrogeologykf_2 | 0 | 1 | 0.04 | 0.18 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| hydrogeologykf_7 | 0 | 1 | 0.00 | 0.02 | 0.00 | 0.00 | 0.00 | 0.00 | 0.95 | ▇▁▁▁▁ |
| hydrogeologykf_8 | 0 | 1 | 0.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0.00 | 0.53 | ▇▁▁▁▁ |
| hydrogeologyga_S | 0 | 1 | 0.90 | 0.28 | 0.00 | 1.00 | 1.00 | 1.00 | 1.00 | ▁▁▁▁▇ |
| hydrogeologyga_G | 0 | 1 | 0.00 | 0.04 | 0.00 | 0.00 | 0.00 | 0.00 | 0.82 | ▇▁▁▁▁ |
| hydrogeologyga_kA | 0 | 1 | 0.00 | 0.02 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| hydrogeologyga_Me | 0 | 1 | 0.05 | 0.20 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| hydrogeologyga_Ma | 0 | 1 | 0.04 | 0.17 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| elevation_elevation | 0 | 1 | 178.52 | 200.25 | -3.75 | 42.90 | 89.15 | 283.62 | 1846.03 | ▇▂▁▁▁ |
| slope_slope | 0 | 1 | 2.14 | 3.38 | 0.00 | 0.15 | 0.57 | 2.72 | 43.65 | ▇▁▁▁▁ |
| aspect_aspect | 0 | 1 | 178.40 | 27.61 | 35.84 | 162.94 | 179.11 | 194.25 | 301.42 | ▁▁▇▃▁ |
| mohplp_streamorder1 | 12 | 1 | 5235.34 | 2961.90 | 0.00 | 2762.00 | 5273.00 | 7896.00 | 10030.00 | ▆▆▆▇▇ |
| mohplp_streamorder2 | 5 | 1 | 5428.72 | 2906.28 | 0.00 | 3045.00 | 5574.00 | 7998.00 | 10030.00 | ▅▆▆▇▇ |
| mohplp_streamorder3 | 3 | 1 | 5505.89 | 2920.10 | 0.00 | 3068.35 | 5735.00 | 8137.00 | 10030.00 | ▅▆▆▆▇ |
| mohplp_streamorder4 | 3 | 1 | 5486.75 | 2934.16 | 0.00 | 3061.00 | 5682.67 | 8145.00 | 10007.00 | ▅▆▆▆▇ |
| mohplp_streamorder5 | 3 | 1 | 5444.35 | 3002.40 | 0.00 | 2881.00 | 5758.00 | 8150.00 | 10001.00 | ▆▅▅▆▇ |
| mohplp_streamorder6 | 3 | 1 | 5543.70 | 2952.94 | 0.00 | 3123.00 | 5870.00 | 8180.00 | 10000.00 | ▅▅▆▆▇ |
| mohplp_streamorder7 | 3 | 1 | 5314.65 | 3058.03 | 0.00 | 2657.00 | 5427.00 | 8128.00 | 10000.00 | ▆▅▆▆▇ |
| mohplp_streamorder8 | 3 | 1 | 7679.21 | 1413.66 | 385.00 | 6760.00 | 7644.00 | 8814.00 | 9999.00 | ▁▁▂▇▇ |
| mohpdsd_streamorder1 | 12 | 1 | 5235.34 | 2961.90 | 0.00 | 2762.00 | 5273.00 | 7896.00 | 10030.00 | ▆▆▆▇▇ |
| mohpdsd_streamorder2 | 5 | 1 | 5428.72 | 2906.28 | 0.00 | 3045.00 | 5574.00 | 7998.00 | 10030.00 | ▅▆▆▇▇ |
| mohpdsd_streamorder3 | 3 | 1 | 5505.89 | 2920.10 | 0.00 | 3068.35 | 5735.00 | 8137.00 | 10030.00 | ▅▆▆▆▇ |
| mohpdsd_streamorder4 | 3 | 1 | 5486.75 | 2934.16 | 0.00 | 3061.00 | 5682.67 | 8145.00 | 10007.00 | ▅▆▆▆▇ |
| mohpdsd_streamorder5 | 3 | 1 | 5444.35 | 3002.40 | 0.00 | 2881.00 | 5758.00 | 8150.00 | 10001.00 | ▆▅▅▆▇ |
| mohpdsd_streamorder6 | 3 | 1 | 5543.70 | 2952.94 | 0.00 | 3123.00 | 5870.00 | 8180.00 | 10000.00 | ▅▅▆▆▇ |
| mohpdsd_streamorder7 | 3 | 1 | 5314.65 | 3058.03 | 0.00 | 2657.00 | 5427.00 | 8128.00 | 10000.00 | ▆▅▆▆▇ |
| mohpdsd_streamorder8 | 3 | 1 | 7679.21 | 1413.66 | 385.00 | 6760.00 | 7644.00 | 8814.00 | 9999.00 | ▁▁▂▇▇ |
The first three rows of the data set containing the features is shown in table 2.4 as an example. This table was then joined with the table holding the target variable by the station_id
| station_id | sampledepth_sampledepth | lulc_agriculturalareas | lulc_forestandseminaturalareas | lulc_artificialsurfaces | lulc_waterbodies | lulc_wetlands | gwrecharge_gwrecharge | seepage_seepage | temperature_temperature | precipitation_precipitation | hydrounits_14 | hydrounits_13 | hydrounits_17 | hydrounits_15 | hydrounits_63 | hydrounits_31 | hydrounits_62 | hydrounits_101 | hydrounits_41 | hydrounits_64 | hydrounits_65 | hydrounits_0 | hydrounits_66 | hydrounits_95 | hydrounits_71 | hydrounits_92 | hydrounits_96 | hydrounits_32 | hydrounits_52 | hydrounits_12 | hydrounits_11 | hydrounits_81 | hydrounits_33 | hydrounits_54 | hydrounits_51 | hydrounits_16 | hydrounits_83 | hydrounits_22 | hydrounits_21 | hydrounits_53 | hydrounits_23 | hydrounits_61 | hydrounits_82 | hydrounits_94 | hydrounits_91 | hydrounits_93 | hydrounits_97 | geology_114 | geology_111 | geology_113 | geology_115 | geology_223 | geology_231 | geology_232 | geology_233 | geology_600 | geology_120 | geology_221 | geology_130 | geology_330 | geology_230 | geology_400 | geology_222 | geology_500 | geology_510 | geology_312 | geology_320 | geology_220 | geology_112 | geology_211 | geology_210 | geology_350 | geology_360 | geology_888 | geology_333 | geology_332 | geology_331 | geology_300 | geology_311 | geology_212 | geology_340 | soilunits_19 | soilunits_12 | soilunits_28 | soilunits_17 | soilunits_33 | soilunits_31 | soilunits_8 | soilunits_36 | soilunits_27 | soilunits_41 | soilunits_11 | soilunits_40 | soilunits_63 | soilunits_55 | soilunits_44 | soilunits_21 | soilunits_49 | soilunits_61 | soilunits_30 | soilunits_54 | soilunits_15 | soilunits_52 | soilunits_72 | soilunits_69 | soilunits_18 | soilunits_59 | soilunits_68 | soilunits_4 | soilunits_2 | soilunits_5 | soilunits_60 | soilunits_66 | soilunits_51 | soilunits_34 | soilunits_46 | soilunits_42 | soilunits_22 | hydrogeologygc_s | hydrogeologygc_Gew | hydrogeologygc_m | hydrogeologygc_a | hydrogeologygc_so | hydrogeologygc_kA | hydrogeologygc_k | hydrogeologygc_g | hydrogeologygc_h | hydrogeologygc_gh | hydrogeologykf_3 | hydrogeologykf_9 | hydrogeologykf_10 | hydrogeologykf_4 | hydrogeologykf_6 | hydrogeologykf_99 | hydrogeologykf_11 | hydrogeologykf_12 | hydrogeologykf_0 | hydrogeologykf_5 | hydrogeologykf_2 | hydrogeologykf_7 | hydrogeologykf_8 | hydrogeologyga_S | hydrogeologyga_G | hydrogeologyga_kA | hydrogeologyga_Me | hydrogeologyga_Ma | elevation_elevation | slope_slope | aspect_aspect | mohplp_streamorder1 | mohplp_streamorder2 | mohplp_streamorder3 | mohplp_streamorder4 | mohplp_streamorder5 | mohplp_streamorder6 | mohplp_streamorder7 | mohplp_streamorder8 | mohpdsd_streamorder1 | mohpdsd_streamorder2 | mohpdsd_streamorder3 | mohpdsd_streamorder4 | mohpdsd_streamorder5 | mohpdsd_streamorder6 | mohpdsd_streamorder7 | mohpdsd_streamorder8 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 110_1 | 37.8 | 0.7103502 | 0.2896498 | 0.0000000 | 0 | 0 | 48.711914 | 147.50668 | 77.02868 | 566.8224 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 96.92465 | 0.2502644 | 165.2706 | 4936 | 1831 | 7135 | 9685 | 9381 | 5714 | 5603 | 9149 | 4936 | 1831 | 7135 | 9685 | 9381 | 5714 | 5603 | 9149 |
| 110_10 | 24.2 | 0.7080190 | 0.1805797 | 0.1114013 | 0 | 0 | 80.524147 | 174.22295 | 77.97757 | 559.4204 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 87.36120 | 0.6016899 | 180.6347 | 1716 | 5411 | 8938 | 8842 | 9281 | 5718 | 5557 | 8989 | 1716 | 5411 | 8938 | 8842 | 9281 | 5718 | 5557 | 8989 |
| 110_100 | 20.5 | 1.0000000 | 0.0000000 | 0.0000000 | 0 | 0 | 1.587526 | -52.47207 | 88.00000 | 535.1705 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 28.67915 | 0.0000000 | 185.9964 | 8771 | 8771 | 1481 | 1281 | 705 | 3764 | 7040 | 7561 | 8771 | 8771 | 1481 | 1281 | 705 | 3764 | 7040 | 7561 |
Like all the data preprocessing all modeling was done using the R programming language and the tidymodels package.
The modeling pipeline includes the following steps:
min_n, tree_depth, learn_rate and loss_reduction for a XGBoost learner (number of trees fixed to 1000) with 50 parameter combinations using a parameter space filling grid| parameter | min_n | tree_depth | learn_rate | loss_reduction | .config |
|---|---|---|---|---|---|
| Ca | 34 | 10 | 0.0145681 | 0.0036223 | Preprocessor1_Model48 |
| Cl | 7 | 9 | 0.0315631 | 0.0000001 | Preprocessor1_Model24 |
| Fe | 37 | 14 | 0.0029476 | 1.9521420 | Preprocessor1_Model33 |
| HCO<sub>3</sub> | 34 | 10 | 0.0145681 | 0.0036223 | Preprocessor1_Model48 |
| K | 12 | 8 | 0.0278855 | 0.0000000 | Preprocessor1_Model20 |
| Mg | 7 | 9 | 0.0315631 | 0.0000001 | Preprocessor1_Model24 |
| Mn | 37 | 14 | 0.0029476 | 1.9521420 | Preprocessor1_Model33 |
| Na | 3 | 6 | 0.0067682 | 0.0000005 | Preprocessor1_Model01 |
| NO<sub>3</sub> | 34 | 10 | 0.0145681 | 0.0036223 | Preprocessor1_Model48 |
| SO<sub>4</sub> | 7 | 9 | 0.0315631 | 0.0000001 | Preprocessor1_Model24 |
…or shown separately for each parameter as bar chart:
[[1]] NULL
[[2]] NULL
[[3]] NULL
[[4]] NULL
[[5]] NULL
[[6]] NULL
[[7]] NULL
[[8]] NULL
[[9]] NULL
[[10]] NULL
[[1]] NULL
[[2]] NULL
[[3]] NULL
[[4]] NULL
[[5]] NULL
[[6]] NULL
[[7]] NULL
[[8]] NULL
[[9]] NULL
[[10]] NULL
This study aims on setting up a benchmark model as a basis for further development of more sophisticated machine learning models. The relatively simple model approach shows that the model performance varies strongly between the target variables indicating that further features are required to reflect the geophysical characteristics which play a role in the processes that drive the concentration of the target variable. This includes the following steps: